Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent large-scale text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a large-scale challenge dataset SR2D that contains sentences describing two objects and the spatial relationship between them. We construct and harness an automated evaluation pipeline that employs computer vision to recognize objects and their spatial relationships, and we employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although recent state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations such as left/right/above/below. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgment about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I spatial reasoning research.
translated by 谷歌翻译
'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et. al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
translated by 谷歌翻译
'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). Recently, there has been growing interest in the study of RAC with visual and linguistic inputs. Graphs are often used to represent semantic structure of the visual content (i.e. objects, their attributes and relationships among objects), commonly referred to as scene-graphs. In this work, we propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language. We experiment with existing CLEVR_HYP (Sampat et. al, 2021) dataset and show that our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
translated by 谷歌翻译
Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene. CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning about the effect of actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects. The CRIPP-VQA test set enables evaluation under several out-of-distribution settings -- videos with objects with masses, coefficients of friction, and initial velocities that are not observed in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties of objects (the focus of prior work).
translated by 谷歌翻译
“行动”在人类与世界互动并使他们实现理想的目标方面起着至关重要的作用。结果,对人类的最常识(CS)知识围绕着行动。尽管“关于行动与变革的推理”(RAC)在知识代表社区中得到了广泛的研究,但它最近引起了NLP和计算机视觉研究人员的兴趣。本文调查了现有的任务,基准数据集,各种技术和模型,以及它们在视觉和语言领域中RAC中进步的各自绩效。最后,我们总结了我们的关键要点,讨论该研究领域面临的目前挑战,并概述了未来研究的潜在方向。
translated by 谷歌翻译
表问题回答(TQA)是一项重要但不足的任务。大多数现有的QA数据集都采用非结构化文本格式,只有很少的数据集使用表作为上下文。据我们所知,在生物医学领域中,没有任何TQA数据集存在经常用于提供信息的生物医学领域。在本文中,我们首先使用22个模板和关于鉴别诊断的生物医学教科书中的上下文来回答数据集Biotabqa的桌子问题。 Biotabqa不仅可以用来教授模型如何从表中回答问题,还可以评估模型如何推广到看不见的问题,这是生物医学应用的重要情况。为了实现概括评估,我们将模板分为17个培训和5个跨任务评估。然后,我们使用BioTABQA上的单个和多任务学习开发两个基准。此外,我们探索教学学习,这是一种显示出令人印象深刻的概括性能的技术。实验结果表明,我们的指导调整模型在各种评估设置中平均比单一和多任务基准平均比单一和多任务基准,更重要的是,更重要的是,指令调整的模型在交叉任务上的基准比5% 。
translated by 谷歌翻译
为了在单一源领域的概括中取得成功,最大化合成域的多样性已成为最有效的策略之一。最近的许多成功都来自预先指定模型在培训期间暴露于多样性类型的方法,因此它最终可以很好地概括为新领域。但是,基于na \“基于多样性的增强也不能因为它们无法对大型域移动建模,或者因为预先指定的变换的跨度不能涵盖域概括中通常发生的转移类型。解决这个问题,我们提出了一个新颖的框架,该框架使用神经网络使用对抗学习的转换(ALT)来建模可欺骗分类器的合理但硬的图像转换。该网络是为每个批次的随机初始初始初始初始初始初始化的,并培训了固定数量的步骤。为了最大化分类错误。此外,我们在分类器对干净和转化的图像的预测之间实现一致性。通过广泛的经验分析,我们发现这种对抗性转换的新形式同时实现了多样性和硬度的目标,并超越了所有现有技术,以实现竞争性的所有技术单源域概括的基准。我们还显示了T HAT ALT可以自然地与现有的多样性模块合作,从而产生高度独特的源域,导致最先进的性能。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
在开放的书本回答(OBQA)任务中,从分散注意力的信息中选择相关段落和句子对于推理问题的答案至关重要。 HOTPOTQA数据集旨在教授和评估系统以进行段落排名和句子选择。许多现有框架使用单独的模型分别选择相关段落和句子。这样的系统不仅在模型的参数方面具有很高的复杂性,而且还无法将训练这两个任务训练在一起的优势,因为一项任务可能对另一个任务有益。在这项工作中,我们提出了一个简单而有效的框架,可以通过共同排名段落和选择句子来解决这些限制。此外,我们提出一致性和相似性约束,以促进段落排名和句子选择之间的相关性和相互作用。实验表明,我们的框架可以与以前的系统实现竞争性结果,并就相关句子的确切匹配而优于28 \%在HOTPOTQA数据集上。
translated by 谷歌翻译
我们介绍了一个机器人组装系统,该系统简化了从产品组件的CAD模型到完整编程和自适应组装过程的设计对制造工作流程。我们的系统(在CAD工具中)捕获了特定机器人工作电脑组装过程的意图,并生成了任务级指令的配方。通过将视觉传感与深度学习的感知模型相结合,机器人推断出从生成的配方中组装设计的必要动作。感知模型是直接从模拟训练的,从而使系统可以根据CAD信息识别各个部分。我们用两个机器人的工作栏演示了系统,以组装互锁的3D零件设计。我们首先在模拟中构建和调整组装过程,并验证生成的食谱。最后,真正的机器人工作电池使用相同的行为组装了设计。
translated by 谷歌翻译